library(ggplot2)
library(GGally)
library(plotly)

Question 2

d.)

Xh <- c(1, 0, 0, 0)
XtXinv <-
  matrix(rbind(
    c(0.087,-0.014,-0.035,-0.004),
    c(-0.014, 0.115,-0.012,-0.052),
    c(-0.035,-0.012, 0.057,-0.014),
    c(-0.004,-0.052,-0.014, 0.050)
  ), nrow = 4, ncol = 4)
XtXinv
##        [,1]   [,2]   [,3]   [,4]
## [1,]  0.087 -0.014 -0.035 -0.004
## [2,] -0.014  0.115 -0.012 -0.052
## [3,] -0.035 -0.012  0.057 -0.014
## [4,] -0.004 -0.052 -0.014  0.050
SE.predh <- sqrt(1.040 * (1 + t(Xh) %*% XtXinv %*% Xh))
SE.predh
##         [,1]
## [1,] 1.06324
qt(1 - (0.05 / 2), 30 - 4)
## [1] 2.055529
c(0.9918 - (qt(1 - (0.05 / 2), 30 - 4) * SE.predh), 0.9918 + (qt(1 - (0.05 / 2), 30 - 4) * SE.predh))
## [1] -1.193722  3.177322

Question 6

a.)

property <- read.table("property.txt")
colnames(property) <-
  c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property

For the variable types, Age, Vacancy Rate, and Total Square Footage are type doubles, Operating Expenses and Rental Rate are integers. Now we observe the distribution of each variable and some summary statistics.

ggplot(data = property, aes(x = Ren.Rate)) + geom_histogram(bins = 10, fill = "#55A3C0")

summary(property$Ren.Rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.50   14.00   15.00   15.14   16.50   19.25

We can see approximate symmetry with light tails in the age variable. The mean and the median are very close to each other.

ggplot(data = property, aes(x = Age)) + geom_histogram(bins = 10, fill = "#556EC0")

summary(property$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   7.864  15.000  20.000

For the Age variable, we can see a bimodal distribution. The majority of the data sits between 0 and 5 thousand and 10 and 20 thousand. Because of this spread, the median is lower, towards where a larger bulk of the data is. The mean however is leaning towards the center of the data, being skewed by the concentration of larger values.

ggplot(data = property, aes(x = Exp)) + geom_histogram(bins = 13, fill = "#7255C0")

summary(property$Exp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   8.130  10.360   9.688  11.620  14.620

With our Opertating Expenses variable, we observe a left skewed distribution. Most of the data is within our IQR, from about 8.13 to 11.62.

ggplot(data = property, aes(x = Vac.Rate)) + geom_histogram(bins = 9, fill = "#7255C0")

summary(property$Vac.Rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.03000 0.08099 0.09000 0.73000

In the Vacancy Rate data, we can see a very strong right skew. Most of our data does not exceed 0.1. Because of the skewing, we can see that the median is smaller than the mean.

ggplot(data = property, aes(x = Sq.Foot)) + geom_histogram(bins = 7, fill = "#7255C0")

summary(property$Sq.Foot)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27000   70000  129614  160633  236000  484290

With our Square Footage variable, we can again see a right skew. Unlike Vacancy Rate, this skew is not as strong as it has some pull towards the center, as well as a heavier tail.

b.)

ggplotly(ggpairs(property, title = "Correlogram of Property Data"))

From the plot, we can see that the variables with the highest correlation is Rental Rate and Square Footage. The variables with the smallest correlation are Rental Rate and Vacancy Rate. This makes sense, the scatter plot for Rental Rate and Square Footage has the most linear resemblence while Rental Rate and Vacancy Rate show almost no relationship. We can see that Vacancy Rate is negatively correlated with Age and Operating Expenses.

c.)

model1 <- lm(Ren.Rate ~ Age + Exp + Vac.Rate + Sq.Foot, data = property)
summary(model1)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Vac.Rate + Sq.Foot, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1872 -0.5911 -0.0910  0.5579  2.9441 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.220e+01  5.780e-01  21.110  < 2e-16 ***
## Age         -1.420e-01  2.134e-02  -6.655 3.89e-09 ***
## Exp          2.820e-01  6.317e-02   4.464 2.75e-05 ***
## Vac.Rate     6.193e-01  1.087e+00   0.570     0.57    
## Sq.Foot      7.924e-06  1.385e-06   5.722 1.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.137 on 76 degrees of freedom
## Multiple R-squared:  0.5847, Adjusted R-squared:  0.5629 
## F-statistic: 26.76 on 4 and 76 DF,  p-value: 7.272e-14

The least squares estimates are the following:

model1$coefficients
##   (Intercept)           Age           Exp      Vac.Rate       Sq.Foot 
##  1.220059e+01 -1.420336e-01  2.820165e-01  6.193435e-01  7.924302e-06

The fitted regression function: \[ Y_i = 12.220 - 0.142 X_{i1} + 0.282X_{i2} + 0.619X_{i3} + 7.924e^{-06}X_{i4} \]

The \(R^2\) for the model is \(0.5847\) and the \(R^2_{adj}\) is \(0.5629\).

We can print the ANOVA table to gather more information.

#MSE
anova(model1)

The ANOVA table shows us that our \(MSE = 1.293\).

d.)

#Residuals vs Fitted Values
ggplot() + aes(x = model1$fitted.values, model1$residuals) + geom_point() + labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs. Fitted Values") + geom_abline(intercept = 0, slope = 0)

## Normal Q-Q plot
ggplot(model1, aes(sample = model1$residuals)) + stat_qq() + stat_qq_line() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "Normal Q-Q")

# Residual Boxplot
ggplot() + aes(x = model1$residuals) + geom_boxplot() + labs(x = "Residuals") 

We can see from the Residual vs Fitted Value plot that we have random scatter with no sign of a pattern, suggesting that a linear model is the best model for this data. Our Normal QQ Plot shows us that our distribution is approximately normal with signs of light tails. Our box plot shows us that our data is approximatly symmetric, with only 4 outliers in the data set.

e.)

#Residuals vs Age
ggplot() + aes(x = property$Age, y = model1$residuals) + geom_point() + labs(x = "Age", y = "Residuals", title = "Residuals vs. Age") + geom_abline(intercept = 0, slope = 0)

# Residuals vs Operating Expenses
ggplot() + aes(x = property$Exp, y = model1$residuals) + geom_point() + labs(x = "Operating Expenses", y = "Residuals", title = "Residuals vs. Operating Expenses") + geom_abline(intercept = 0, slope = 0)

# Residuals vs Operating Expenses
ggplot() + aes(x = property$Vac.Rate, y = model1$residuals) + geom_point() + labs(x = "Vacancy Rate", y = "Residuals", title = "Residuals vs. Vacancy Rate") + geom_abline(intercept = 0, slope = 0)

# Residuals vs Square Footage
ggplot() + aes(x = property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Square Footage", y = "Residuals", title = "Residuals vs. Square Footage") + geom_abline(intercept = 0, slope = 0)

### Interactions 

#Residuals vs Age*Expenses
ggplot() + aes(x = property$Age * property$Exp, y = model1$residuals) + geom_point() + labs(x = "Age*Expenses", y = "Residuals", title = "Residuals vs. Age*Expenses") + geom_abline(intercept = 0, slope = 0)

#Residuals vs Age*Vac.Rate
ggplot() + aes(x = property$Age * property$Vac.Rate, y = model1$residuals) + geom_point() + labs(x = "Age*Vac.Rate", y = "Residuals", title = "Residuals vs. Age*Vac.Rate") + geom_abline(intercept = 0, slope = 0)

#Residuals vs Age*Sq.Foot
ggplot() + aes(x = property$Age * property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Age*Sq.Foot", y = "Residuals", title = "Residuals vs. Age*Sq.Foot") + geom_abline(intercept = 0, slope = 0)

#Residuals vs Expenses*VacRate
ggplot() + aes(x = property$Exp * property$Vac.Rate, y = model1$residuals) + geom_point() + labs(x = "Exp*Vac.Rate", y = "Residuals", title = "Residuals vs. Exp*Vac.Rate") + geom_abline(intercept = 0, slope = 0)

#Residuals vs Expenses*Sq.Foot
ggplot() + aes(x = property$Exp * property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Exp*Sq.Foot", y = "Residuals", title = "Residuals vs. Exp*Sq.Foot") + geom_abline(intercept = 0, slope = 0)

#Residuals vs Vac.Rate*Sq.Foot
ggplot() + aes(x = property$Vac.Rate * property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Vac.Rate*Sq.Foot", y = "Residuals", title = "Residuals vs. Vac.Rate*Sq.Foot") + geom_abline(intercept = 0, slope = 0)

We have six interaction terms. When we observe plots with interactions with Vacancy Rate, we notice that a lot of the observations are very close to zero. Because of this, the data does not have random scatter, but instead, high concentration to the left with outliers on the right. We see this same scenario with interactions with Total Square Footage variable, however not as strong.

In our non-interaction plots, we can see that variables Age and Vacancy Rate have clusters of data, which violates our desire for random scatter.

f.)

To test our Beta coefficients, we have: \[ H_0: \beta_k = 0 \\ H_a: \beta_k \neq 0 \] With test statistic and null distribution: \[ T^* = \frac{\hat{\beta_k}}{s(\hat{\beta_k})} \\ T^* \sim t_{n-p} \]

We reject \(H_0\) if \(T^* > t(1 - \alpha/2, n-p)\).

Thankfully, the summary of our model does this test for us with the p-values. For the intercept, age, operational expenses, and square footage with respective p-values (< 2e-16, 3.89e-09, 2.75e-05, 1.98e-07), we reject \(H_0\), concluding that the coefficients are significantly diffrerent than 0. However, for our Vacancy Rate, with p-value 0.57, we fail to reject \(H_0\), leading us to conclude that Vacancy Rate does not help us predict Rental Rates. ALl other variables we conclude are useful in modeling Rental Rates.

g.)

anova(model1)

\[ SSE = 98.231 \\ df(SSE) = 76 \\ SSR = 14.819 + 72.802 + 8.381 + 42.325 = 138.327\\ df(SSR) = 4\\ SSTO = 98.231 + 138.327 = 236.558\\ df(SSTO) = 80 \]

We test the hypothesis for the regrssion relation:

\[ H_0: \beta_1 = \dots = \beta_{p-1} = 0 \\ H_a: \text{Not all}\ \beta_ks \ \text{equal to zero} \]

\[ F^* = \frac{MSR}{MSE} \\ F^* \sim_{H_0} F_{(p-1, n-p)} \]

We reject \(H_0\) if \(F^* > F(1 - \alpha, p-1, n-p)\).

Fstar <- (138.327 / 4) / (98.231 / 76)
Fstar
## [1] 26.75543
pf(Fstar, 4, 76, lower.tail = FALSE)
## [1] 7.272536e-14

Since our p-value is much smaller than our \(\alpha = 0.01\), we reject our null hypothesis. We can conclude that not all the \(\beta_ks\) equal zero.

h.)

model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
model2$coefficients
##   (Intercept)           Age           Exp       Sq.Foot 
##  1.237058e+01 -1.441646e-01  2.671670e-01  8.178210e-06

The fitted regression function for model2: \[ Y_i = 12.37 - 0.144 X_{i1} + 0.267X_{i2} + 8.178e^{-06}X_{i4} \]

We would decide to make an new model without for \(X_3\) since we concluded in part f that \(X_3\) was not significant to the modeling of rental rates.

summary(model2)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0620 -0.6437 -0.1013  0.5672  2.9583 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.237e+01  4.928e-01  25.100  < 2e-16 ***
## Age         -1.442e-01  2.092e-02  -6.891 1.33e-09 ***
## Exp          2.672e-01  5.729e-02   4.663 1.29e-05 ***
## Sq.Foot      8.178e-06  1.305e-06   6.265 1.97e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared:  0.583,  Adjusted R-squared:  0.5667 
## F-statistic: 35.88 on 3 and 77 DF,  p-value: 1.295e-14
anova(model2)

The value of \(R^2 = 0.583\) and the \(R^2_{adj} = 0.5667\). The \(MSE = 1.281\).

i.)

sumModel1 <- summary(model1)
sumModel2 <- summary(model2)

# Print the std errors 
sumModel1$coefficients[,2]
##  (Intercept)          Age          Exp     Vac.Rate      Sq.Foot 
## 5.779562e-01 2.134261e-02 6.317235e-02 1.086813e+00 1.384775e-06
sumModel2$coefficients[,2]
##  (Intercept)          Age          Exp      Sq.Foot 
## 4.928469e-01 2.092012e-02 5.729487e-02 1.305377e-06

We can see that the std. error of the coefficients are smaller for each coefficient. If we had constructed intervals for model 1 with the larger standard errors, we would have found the intervals to be wider than the intervals made for model 2.

tval <- qt(1 - (0.01/2), nrow(property) - 4)

#confidence intervals 
t(
  rbind(
    sumModel2$coefficients[, 1] - tval * sumModel2$coefficients[, 2],
    sumModel2$coefficients[, 1] + tval * sumModel2$coefficients[, 2]
  )
)
##                      [,1]          [,2]
## (Intercept)  1.106888e+01  1.367229e+01
## Age         -1.994188e-01 -8.891046e-02
## Exp          1.158399e-01  4.184941e-01
## Sq.Foot      4.730452e-06  1.162597e-05

j.)

# Prediction for model 1
predict(model1, data.frame(Age = 4, Exp = 10, Vac.Rate = 0.1, Sq.Foot = 80,000), interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 14.51518 11.42732 17.60305
# Prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80,000), interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 14.46625 11.40128 17.53121
# Range for interval 1
17.60305 - 11.42732
## [1] 6.17573
# Range for interval 2
17.53121 - 11.40128
## [1] 6.12993

We can see that the interval for model2 is slightly more narrow than than the interval for model1.

k.)

Overall, we would prefer to use model 2. We have a lower MSE, statistically signifanct predictors for every variable in the model, and higher \(R^2_{adj}\) value. A model with less variables is also easier to interpret.